Introduction

Life expectancy is a measure of how long an individual is expected to live on average and is commonly used in designing policy or even as a social indicator to evaluate the quality of life for any given region (see “An Overarching Health Indicator for the Post-2015 Development Agenda” 2014; also “An Overarching Health Indicator for the Post-2015 Development Agenda,” n.d.).

The goal of this project is to develop a model for predicting life expectancy in Baltimore down to single block resolution with estimates of uncertainty. The hope is that with this new information we would be able to better examine what factors contribute to the life expectancy for any given block in any given neighborhood in Baltimore city and so aid decision making when policy changes are being implemented.

We have data gotten from the city of Baltimore which gives estimates of life expectancy at the Community statistical area (CSA) level. This was done since the boundaries, and the names of the 270+ neighborhoods in Baltimore may change over time. Thus the CSA provides a consistent way to characterize a particular region of the city. Each CSA is made up of several neighborhoods and these neighborhoods may belong to more than CSA. I.e., the boundaries of a CSA may go through a neighborhood.

Since our outcome, life expectancy, is gotten at an aggregate level. This project aims to provide a street block prediction of the expected life expectancy using statistical downscaling methods.

Data

We have data from Baltimore city website, Baltimore Neighborhood Indicators Alliance BNIA-JF, Maryland department of planning, and from the Census Bureau. The data consists of information about life expectancy estimates for each neighbourhood, along with crime, economic development and education informmation, all over a 5 year period (2010-2014). I also have street level, and block group level data.

The data fall in three general categories.

  1. Street level
  2. Block group level
  3. Community statistical area level (CSA)

Data Cleaning and interpolation

The following table gives some of the variables used in the model fitting process, which level we originally got the data at and assumptions we made to get it at a street block level

Variables Name Level Cleaning steps
propfemhh Proportion of households headed by a female with related children under 18 years Block group Since the data was at a block group level and we were interested in getting street block level data, we used Kriging to interpolate data at new locations (street block locations) using the information from the block group level. The locations for the street blocks were ascertained as the median longitude and latitude of all streets that made up the street block. One of the assumptions made here was that the distribution of the variable (propfemhh) was smooth in the sense that street blocks with such households will tend to be similar.
propkids_withinsurance The proportion of individuals less than 18 years who have health insurance for a given block group Block group Here we used the block group value as the value for each street block in that block group. The assumption here was that block groups would tend to be quite homogenous with regard to this variable.
racdiv Racial diversity as calculated per block group Block group This variable was not given but was estimated from the block group data on race. Its estimation proceeds as follows calculate the percent of each race, square the percent for each group, sum the squares, subtract the sum from 1.00. Eight groups were used for the index: White, not Hispanic; Black or African American; American Indian and Alaska Native (AIAN); Asian; Native Hawaiian and Other Pacific Islander (NHOPI); two or more races, not Hispanic; some other race, not Hispanic; Hispanic or Latino. This method is based on that used by the census bureau. More information can be found here. We decided not to interpolate these values for the street blocks but instead used the values from block groups that they belonged to. This was done due to the unique structure of neighborhoods in Baltimore city.
propbelow Proportion of individuals within a block group that lives below the poverty line Block group To get the data at the street block level we interpolated values from the block group level. The assumption used here is that the further into a particular neighborhood you go, the more representative each block is of the aggregate level data for this variable.
mhhi Median household income Block group We interpolated the values for the street blocks from the block group level data using Kriging. Again this is based on the assumption that the further into a particular neighborhood you go, the more representative each block is of the aggregate level data for this variable.
totalincidents # of crime incidents per street Street We aggregated this to get the number of crimes committed per street block
prop.vacant Proportion of vacant homes Street We divided the number of vacant homes per street block by the total number of homes in that street block.

The rest of the variables used in the final model includes: Percentage of Students Suspended or Expelled During School Year (susp); Liquor Outlet density per 1,000 Residents (liquor); and Percent of Residences Heated by Electricity (elheat). Note that all the variables mentioned above were observed at the CSA level. Furthermore, I did not do any interpolation for these variables at the street block level as I felt that the assumptions inherent in the process would be untenable.

Descriptives

Since the goal of this analysis is to predict life expectancy at the street block level and since the block information contained in the dataset was not properly defined, I made a couple of plots to see what was census block and what was a street block.

Furthermore, since some of the data files have information on neighbourhood blocks, I plotted the Neighbourhood information as defined or delineated by the block level data gotten from the Baltimore city website and then overlayed the neighbourhood data gotten from the Maryland department of planning. Futhermore, using information from the Baltimore gisdata website I was able to obtain what “block” was actually defined as. All of this points to the possiblity of using blocks from our dataset as street blocks.

Here the colored points are the street blocks, while the grids represents a census block. For more plots examining the fits please visit my github repo

All of this indicate a good fit. I also used gis data from the baltimore city website and I found that each block was defined as a street block.An example of a cityblock pulled from dataset

Analysis

Checking for Spatial correlation

Since we have spatial data I ran the both Mantel test(c.f Mantel 1966) and Moran’s I (c.f Moran 1950) to examine if spatial autocorrelation exists in this dataset. Please note that while both test measure spatial autocorrelation, they refer to quite different concepts.

Mantel’s test(Mantel 1966; Dutilleul et al. 2000) gives correlation between different variables due to their spatial location, that is Mantel’s test judges whether closeness in one set of variables is related to closeness in another set of variable. Relating this to our datasetwe can use it to see if samples that are close in terms of their geographic location values are also close in terms of life expectancy values. I.e test if the distance matrix based on life expectancy values is correlated with the distance matrix based on spatial location for the CSA’s

## Monte-Carlo test
## Call: ade4::mantel.randtest(m1 = csa.dists, m2 = le14.dists, nrepet = 9999)
## 
## Observation: 0.1308348 
## 
## Based on 9999 replicates
## Simulated p-value: 0.0249 
## Alternative hypothesis: greater 
## 
##       Std.Obs   Expectation      Variance 
##  2.0485777370 -0.0001400827  0.0040876281

Based on these results, we can reject the null hypothesis that these two matrices, spatial distance and life expectancy distance (2014), are unrelated with alpha = 0.05. The observed correlation, r = 0.13, suggests that the matrix entries are positively associated. So smaller differences in life expectancy are generally seen among pairs of CSA’s that are close to each other than far from each other. Note that since this test is based on random permutations, the same code will always arrive at the same observed correlation but rarely the same p-value. Furthemore, I ran this test for all four years in the datset set and the conclusions are consistent. If you are interested in the correlation values for those years here is the code.

Moran’s I(Moran 1950) is useful when one wants to know the correlation of a variable with itself through space. I.e., when one wants to know to which extent the occurrence of an event in an areal unit makes it more likely or unlikely the occurrence of an event in a neighboring areal unit. I.e if life expectancy is low in the north does that mean that we likely to see low life expectancy in the same region? Thus the null is the lack of existence of spatial autocorrelation.

## $observed
## [1] 0.08081236
## 
## $expected
## [1] -0.01851852
## 
## $sd
## [1] 0.0174684
## 
## $p.value
## [1] 1.298067e-08

Based on these results, we can reject the null hypothesis that there is zero spatial autocorrelation present in life expectancy at the 5% level of significance. For more tests using data from 2011 to 2014 please check here.

Regression Model for spatial data

## [using ordinary kriging]
## [using ordinary kriging]

Geographically Weighted Regression (GWR)

  • The structure of the model does not remain constant over the study area (there are local variations in the parameter estimates)
  • To account for this potential spatial heterogeneity we use the GWR model (Fotheringham, Brunsdon, and Charlton 2002)
  • GWR permits the parameter estimates to vary locally.
GWR

This model uses a weighted least squares approach to account for spatial heteorgeniety and is as follows \[Y_i = X\beta_i +\epsilon\] where \(\beta_i\) is solved using the WLS approach. Thus \[\beta_i = (X^TW_iX)^{-1}X^TW_iY\] where \(W_i\) is the spatial weight matrix which is based on the distance between observations at location i. Using the approach of (Fotheringham, Brunsdon, and Charlton 2002), \(W(u_i,v_i)\) is an \(n \times n\) diagonal matrix denoting the spatial weighting of each observation point for model calibration point i at location \((u_i,v_i)\). This can be specificed using three metrics: 1. The type of distance function used e.g the Great circle distance; 2. The kernel function, that is how to relate the distances; and 3. Its bandwidth, I.e how many neighborhoods to use. So for the jth element in \(W(u_i,v_i)\) if we use the Gaussian kernel, we have that the \(w_{ij}\) element is \[\exp\left(-\dfrac{1}{2}\left(\dfrac{d_{ij}}{b}\right)^2\right)\].

Model selection

Data

First I divided my data into a training and testing dataset. All the model selection procedures were then performed on the training dataset. To obtain candidate models to use for the GWR method, I used step wise model selection under four scenarios using the ordinary least squares regression approach

Criteria = AIC Criteria = BIC
Force variables in model selection = Yes st1 st2
Force variables in model selection = No st3 st4

Where the variables that are forced are the aggregated CSA level variables for which we have street block level information. They include propfemhh, totalincidents, and prop.vacant. Based on this four models, I then selected the model that minimised the predictor error. I.e, the model with the best prediction performance.

## 
## 5-fold CV results:
##   Fit       CV
## 1 st1 2.275639
## 2 st2 2.370945
## 3 st3 1.994983
## 4 st4 1.941451
## 
## Best model:
##    CV 
## "st4" 
## 
## Call:
## lm(formula = lifeexp ~ propfemhh + propbelow.pred + susp + liquor + 
##     elheat, data = dat$train_data[, -c(1, 29:30)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9084 -1.4544 -0.5442  1.1563  5.5734 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     73.7050     0.3542 208.067  < 2e-16 ***
## propfemhh       -1.9976     0.6881  -2.903 0.006543 ** 
## propbelow.pred  -1.4172     0.6378  -2.222 0.033250 *  
## susp            -1.6357     0.5117  -3.197 0.003059 ** 
## liquor          -2.2095     0.4371  -5.055 1.57e-05 ***
## elheat           1.8265     0.4615   3.958 0.000379 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.162 on 33 degrees of freedom
## Multiple R-squared:  0.8089, Adjusted R-squared:   0.78 
## F-statistic: 27.94 on 5 and 33 DF,  p-value: 5.767e-11

Methods for Downscaling

  • Delta method: Here, after we find the model that fits the date best, using aggregated data.
    1. We predict what the life expectancy would be after we remove one of the blocks from the aggregated data, call this \[ T_{-b} = E(Y)_{-b} \]
    2. Then we find the delta in predicted life expectancy at the CSA level due to the removed block as \[ T_{\delta_b} = T_{full} – T_{-b} \] Call this delta the change in the mean life expectancy at the CSA level due to that block.
    3. Add the delta to the observed life expectancy at that CSA. Call this the predicted life expectancy due to that block Note that this inherently assumes that the observed life expectancy at a CSA is the true underlying life expectancy and that all the blocks in the neighborhoods that belong that CSA vary about it.
  • Transfer function: Find which aggregated predictors provide the best fit, then use a “transfer function” to map the aggregated variables to the block level and use the value gotten as a predictor to get block level estimates. For this sceanario, I centered and scaled both the street block level data and the aggregated CSA level data, I.e I subtracted the mean of each variable (taken over the whole dataset) from itself and divided the centered variable by its standard deviation. By this my aim was to get estimates that were invariant to the fact that the CSA level data was an aggregated form of the street block level data. Then using the GWR model above, I predicted(Gollini et al. 2015) the life expectancy for a given street block. That is, \[Y_{pred_i} = W_i^{1/2}X_{block}\beta_i\] breaking this down, this means that the predicted life expectancy for a given street block is a function of its distance from the CSA at location i (median longitude and latitude for all blocks in that CSA), the observed street block values at that street blocks location and the \(\beta\) value of the CSA at location i.
## Fixed bandwidth: 10.35355 CV score: 219.045 
## Fixed bandwidth: 6.400124 CV score: 217.677 
## Fixed bandwidth: 3.956774 CV score: 224.921 
## Fixed bandwidth: 7.910197 CV score: 218.2388 
## Fixed bandwidth: 5.466847 CV score: 217.8346 
## Fixed bandwidth: 6.976921 CV score: 217.86 
## Fixed bandwidth: 6.043644 CV score: 217.6339 
## Fixed bandwidth: 5.823327 CV score: 217.6583 
## Fixed bandwidth: 6.179807 CV score: 217.6407 
## Fixed bandwidth: 5.95949 CV score: 217.6375 
## Fixed bandwidth: 6.095654 CV score: 217.6348 
## Fixed bandwidth: 6.0115 CV score: 217.6345 
## Fixed bandwidth: 6.06351 CV score: 217.634

The plot above shows what the observed life expectancy was in the testing dataset compared to the two methods mentioned above. The grey polygons represent CSA’s that are in the training dataset, while the colored regions represent CSA’s that are in the testing dataset.

Datasets

Name Information Type Data Source Geographic Scale Date
Real Property Taxes Contains information on which streets belong to which block and in what neighbourhood along with their longitude and latitude. Also has information on police district. Dataset Baltimore city website Street Level 2016
Real Property Contains the City of Baltimore parcel boundaries, with ownership, address, valuation and other property information. Furthermore, it also contains street block definitions. Dataset Baltimore gisdata website Street level 2016
Census Block GIS shapefile which has information on census block designation for 2010 Shapefile Maryland department of planning Block level 2010
Neighborhoood Polygon feature representing the boundaries of Baltimore City’s neighborhoods as of the year 2010 Shapefile Baltimore city website Neighborhood level 2010
Census Demographics for 2010 to 2014 Contains neighborhood level demographics data Dataset Baltimore Neighborhood Indicators Alliance BNIA-JF Neighborhood level 2010 - 2014
Children and Family Health & Well-Being Has information on life expectancy for 2010 to 2014 Dataset Baltimore Neighborhood Indicators Alliance BNIA-JF Neighborhood level 2010 - 2014
BNIA Vital Signs Codebook Contain information on short variable names and their corresponding full names, along with their sources for each dataset Dataset Baltimore city website Neighborhood level 2016
Housing and Community Development Has information on the state of households in Baltimore city, viz;Number of Homes Sold,Percentage of Residential Properties that are Vacant and Abandoned,Percent Residential Properties that do Not Receive Mail, etc. Dataset Baltimore Neighborhood Indicators Alliance BNIA-JF Neighborhood level 2010-2014
BNIA Data linking CSA to Neighborhoods Has information on which neighborhoods belong to what CSA. Note that a neighborhood may belong to more than one CSA Dataset Baltimore Neighborhood Indicators Alliance BNIA-JF CSA and Neighborhood level 2010
Census Bureau Has information at the block group level. This includes information on family types, poverty status, the median househoold income Dataset American FactFinder Census tract and Block group level 2014
devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                       
##  version  R version 3.3.1 (2016-06-21)
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_United States.1252  
##  tz       America/New_York            
##  date     2016-10-20
## Packages ------------------------------------------------------------------
##  package       * version  date       source        
##  ade4            1.7-4    2016-03-01 CRAN (R 3.3.1)
##  ape             3.5      2016-05-24 CRAN (R 3.3.1)
##  assertthat      0.1      2013-12-06 CRAN (R 3.3.1)
##  bitops          1.0-6    2013-08-17 CRAN (R 3.3.0)
##  boot            1.3-18   2016-02-23 CRAN (R 3.3.1)
##  broom         * 0.4.1    2016-06-24 CRAN (R 3.3.1)
##  coda            0.18-1   2015-10-16 CRAN (R 3.3.1)
##  colorspace      1.2-6    2015-03-11 CRAN (R 3.3.1)
##  cvTools       * 0.3.2    2012-05-14 CRAN (R 3.3.1)
##  DBI             0.4-1    2016-05-08 CRAN (R 3.3.1)
##  deldir          0.1-12   2016-03-06 CRAN (R 3.3.0)
##  DEoptimR        1.0-5    2016-07-01 CRAN (R 3.3.1)
##  devtools      * 1.12.0   2016-06-24 CRAN (R 3.3.1)
##  digest          0.6.9    2016-01-08 CRAN (R 3.3.1)
##  downloader    * 0.4      2015-07-09 CRAN (R 3.3.1)
##  dplyr         * 0.5.0    2016-06-24 CRAN (R 3.3.1)
##  evaluate        0.9      2016-04-29 CRAN (R 3.3.1)
##  FNN             1.1      2013-07-31 CRAN (R 3.3.1)
##  foreign         0.8-66   2015-08-19 CRAN (R 3.3.0)
##  formatR         1.4      2016-05-09 CRAN (R 3.3.1)
##  gdata           2.17.0   2015-07-04 CRAN (R 3.3.0)
##  geosphere       1.5-5    2016-06-15 CRAN (R 3.3.1)
##  ggmap         * 2.6.1    2016-01-23 CRAN (R 3.3.1)
##  ggplot2       * 2.1.0    2016-03-01 CRAN (R 3.3.1)
##  gmodels         2.16.2   2015-07-22 CRAN (R 3.3.1)
##  gstat         * 1.1-3    2016-03-31 CRAN (R 3.3.1)
##  gtable          0.2.0    2016-02-26 CRAN (R 3.3.1)
##  gtools          3.5.0    2015-05-29 CRAN (R 3.3.0)
##  GWmodel       * 1.2-5    2015-02-01 CRAN (R 3.3.1)
##  htmltools       0.3.5    2016-03-21 CRAN (R 3.3.1)
##  httr            1.2.0    2016-06-15 CRAN (R 3.3.1)
##  intervals       0.15.1   2015-08-27 CRAN (R 3.3.0)
##  jpeg            0.1-8    2014-01-23 CRAN (R 3.3.0)
##  knitr           1.13     2016-05-09 CRAN (R 3.3.1)
##  labeling        0.3      2014-08-23 CRAN (R 3.3.0)
##  lattice       * 0.20-33  2015-07-14 CRAN (R 3.3.1)
##  lazyeval        0.2.0    2016-06-12 CRAN (R 3.3.1)
##  LearnBayes      2.15     2014-05-29 CRAN (R 3.3.0)
##  lubridate       1.5.6    2016-04-06 CRAN (R 3.3.1)
##  magrittr        1.5      2014-11-22 CRAN (R 3.3.1)
##  mapproj         1.2-4    2015-08-03 CRAN (R 3.3.1)
##  maps            3.1.0    2016-02-13 CRAN (R 3.3.1)
##  maptools      * 0.8-39   2016-01-30 CRAN (R 3.3.1)
##  MASS            7.3-45   2016-04-21 CRAN (R 3.3.1)
##  Matrix        * 1.2-6    2016-05-02 CRAN (R 3.3.1)
##  memoise         1.0.0    2016-01-29 CRAN (R 3.3.1)
##  mnormt          1.5-4    2016-03-09 CRAN (R 3.3.0)
##  munsell         0.4.3    2016-02-13 CRAN (R 3.3.1)
##  nlme            3.1-128  2016-05-10 CRAN (R 3.3.1)
##  plyr          * 1.8.4    2016-06-08 CRAN (R 3.3.1)
##  png             0.1-7    2013-12-03 CRAN (R 3.3.0)
##  proto           0.3-10   2012-12-22 CRAN (R 3.3.0)
##  psych           1.6.6    2016-06-28 CRAN (R 3.3.1)
##  R6              2.1.2    2016-01-26 CRAN (R 3.3.1)
##  rappdirs        0.3.1    2016-03-28 CRAN (R 3.3.1)
##  RColorBrewer  * 1.1-2    2014-12-07 CRAN (R 3.3.0)
##  Rcpp            0.12.5   2016-05-14 CRAN (R 3.3.1)
##  RCurl           1.95-4.8 2016-03-01 CRAN (R 3.3.0)
##  readr         * 0.2.2    2015-10-22 CRAN (R 3.3.1)
##  readxl        * 0.1.1    2016-03-28 CRAN (R 3.3.1)
##  reshape2        1.4.1    2014-12-06 CRAN (R 3.3.1)
##  RevoUtils       10.0.1   2016-08-24 local         
##  RevoUtilsMath * 8.0.3    2016-04-13 local         
##  rgdal           1.1-10   2016-05-12 CRAN (R 3.3.1)
##  rgeos           0.3-19   2016-04-04 CRAN (R 3.3.1)
##  RgoogleMaps     1.2.0.7  2015-01-21 CRAN (R 3.3.1)
##  rjson           0.2.15   2014-11-03 CRAN (R 3.3.0)
##  RJSONIO         1.3-0    2014-07-28 CRAN (R 3.3.0)
##  rmarkdown       0.9.6    2016-05-01 CRAN (R 3.3.1)
##  robustbase    * 0.92-6   2016-05-31 CRAN (R 3.3.1)
##  scales          0.4.0    2016-02-26 CRAN (R 3.3.1)
##  sp            * 1.2-3    2016-04-14 CRAN (R 3.3.1)
##  spacetime       1.1-5    2015-12-26 CRAN (R 3.3.1)
##  spdep         * 0.6-5    2016-06-02 CRAN (R 3.3.1)
##  spgwr         * 0.6-28   2015-09-29 CRAN (R 3.3.1)
##  stringi         1.1.1    2016-05-27 CRAN (R 3.3.0)
##  stringr       * 1.0.0    2015-04-30 CRAN (R 3.3.1)
##  tibble          1.0      2016-03-23 CRAN (R 3.3.0)
##  tidyr           0.5.1    2016-06-14 CRAN (R 3.3.1)
##  tigris        * 0.3.3    2016-07-06 CRAN (R 3.3.1)
##  uuid            0.1-2    2015-07-28 CRAN (R 3.3.0)
##  withr           1.0.2    2016-06-20 CRAN (R 3.3.1)
##  XML           * 3.98-1.4 2016-03-01 CRAN (R 3.3.0)
##  xts             0.9-7    2014-01-02 CRAN (R 3.3.1)
##  yaml            2.1.13   2014-06-12 CRAN (R 3.3.1)
##  zoo             1.7-13   2016-05-03 CRAN (R 3.3.1)

References

“An Overarching Health Indicator for the Post-2015 Development Agenda.” 2014. http://www.who.int/healthinfo/indicators/hsi_indicators_SDG_TechnicalMeeting_December2015_BackgroundPaper.pdf.

“An Overarching Health Indicator for the Post-2015 Development Agenda.” n.d. http://hdr.undp.org/en/content/human-development-index-hdi.

Dutilleul, Pierre, Jason Stockwell, Dominic Frigon, and Pierre Legendre. 2000. “The Mantel Test Versus Pearson’s Correlation Analysis: Assessment of the Differences for Biological and Environmental Studies.” Journal of Agricultural, Biological, and Environmental Statistics 5 (June). International Biometric Society: 131–50. http://www.jstor.org/stable/1400528.

Fotheringham, A. Stewart, Chris Brunsdon, and Martin Charlton. 2002. Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. Wiley.

Gollini, Isabella, Binbin Lu, Martin Charlton, Christopher Brunsdon, and Paul Harris. 2015. “GWmodel: An R Package for Exploring Spatial Heterogeneity Using Geographically Weighted Models.” Journal of Statistical Softwar 63 (February). http://dx.doi.org/10.18637/jss.v063.i17.

Mantel, Nathan. 1966. “The Detection of Disease Clustering and a Generalized Regression Approach.” American Association for Cancer Research., September.

Moran, Patrick Alfred Pierce. 1950. “Notes on Continuous Stochastic Phenomena.” Biometrika 37 (June). Oxford University Press on behalf of Biometrika Trust: 17–23. http://www.jstor.org/stable/2332142.